Editing and imputing are both methods of data processing. Editing refers to the detection and correction of errors in the data. Whilst imputation is a method of correcting errors and estimating and filling in missing values in a dataset. Where there are errors in the dataset, and when these are considered to add no value in the correction process, these values are set to missing and are imputed with a plausible estimate. Alternatively, missing values may already exist in the data, and imputation may be carried out to produce a complete dataset for analysis.
This research project evaluated the use of machine learning methods for imputation. In order to provide a context for using machine learning in the imputation process, the reader is presented with:
Missingness and erroneous values can impact the quality of data. A large volume of incorrect and/or missing values increase the risk of the product failing to capture the target concept or target population. That is, ommissions (introduced in collection or processing) may result in certain sub-groups of the target population from being excluded in the analysis dataset, and in turn increasing the risk of biased estimates, reducing the power of inferential statistics and increasing the uncertainty of estimates and inferences derived from the data. Similarly, errors in a dataset may impact the degree to which the final estimate or output represents the reality it was designed to capture.
Correcting erroneous responses and filling in missing values helps manage the quality of data. A complete dataset can improve the accuracy and reliability of estimates, and help maintain the consistency of counts across published tables. Moreover, a dataset with fewer errors and more units may more accuately capture the underlying distribution of the variable of interest. Selecting a method for estimating values in a datset is generally advised by the nature of errors or missingness in the data, and the output desired from the analysis dataset.
An imputation process of a dataset can be broken down into the following three phases:
The focus of this project was in applying Machine Learning methods to amend values in a dataset. That is, it was of interest to compare existing approaches, of treating missing or erroneous values by estimating replacement figures, to machine learning methods. Methods of variable amendment can be grouped into one of the following categories:
The mechanisms for a given imputation method could be deterministic or stochastic. The former refers to instances where repeated trials of the same method yield identical output. Whereas the latter refers to instances where there is element of randomness; repeated iterations will produce different results.
Interactive treatment refers to a class of methods whereby the data are adjusted by a human editor by either re-contacting the respondent/ data provider, replacing values from another variable/ data source, or creating a value based on subject matter expertise.
Deductive imputation uses logic or an understanding about the relationship between variables and units to fill in missing values. Examples include deriving a value as a function of other values, adopting a value from a related unit, and adopting a value from an earlier time point. Generally, this method is used when the true value can be derived with certainty or with a very high probability.
Model based imputation refers to a class of methods that estimate missing values using assumptions about the distribution of the data, which include mean and median imputation. Or assumptions about the relationship between auxiliary variables (or x variables) and the target y variable to predict missing values.
Donor based imputation adopt values from an observed unit, which are then used for the missing unit. For each recipient with a missing value for variable y, a donor is identified that is similar to the recipient with respect to certain background characteristics (often referred to as matching variables) that are related to the target variable y. Such methods are relatively easy to apply when there are several related missing values in one record, and if the intention is to preserve the relationship between variables.
Machine learning is the field of study that enables a program to learn from its experience of iterating through a task multiple times. A performance measure is generally specified by the programmer, which is used to evaluate how well the machine has carried out the task at each iteration. Learning of the task is evidenced by its improvement against the performance measure.
The different types of machine learning systems can be categorised with respect to:
Machine learning systems can vary with regards to the degree of supervision. The major types of supervision:
Supervised learning is the specification of the desired output. That is, the data used to train the model includes the solutions (which are referred to as labels), which the machine learning system attempts to estimate. The desired solutions specified in the machine learning algorithm are referred to as labels.
Unsupervised learning uses training data that is unlabelled. In this class of machine learning systems, the outcome/ desired solutions are not specified in the machine learning algorithm.
Machine learning systems that use partially labelled data are categorised as utilising semi-supervised learning.
Reinforcement learning involves the use of rewards or penalties to train the machine in identifying the appropriate action for a given situation. The learning system, which is referred to as an agent, observes the environment, selects and performs actions, and gets a response in the form of a reward or penalty. After multiple iterations, it identifies the best strategy, referred to as a policy, that results in the most reward over time.
Another criterion for classifying machine learning systems is the way in which the algorithm learns. That is, whether the learning takes place at once or if it happens in increments.
Batch learning uses all the available data to train the machine learning system. This is generally time consuming and computationally expensive, and as a result is carried out offline. Whilst in production, the system is no longer learning, and is simply applying what it has learnt from the full set of training data.
Any changes to the data generating mechanism (GDM) will mean that a new system would need to be trained, from scratch on the full set of data (that includes data points before and after changes to the GDM).
Online learning trains the system incrementally through sequential input of data. Data can be delivered individually or in small groups, referred to as mini-batches. As each learning step is relatively fast and cheap, the system can learn about new data whilst in production, as it arrives. It is an ideal approach for when the velocity of new data is high, and when there is a need to adapt to changes rapidly or autonomously.
Machine learning systems can also be categorised with regards to how the systems generalise. That is, there are different approaches to using the training data to develop a system that can then be generalised to new cases. The two main approaches are instance-based learning and model-based learning.
Instance-based learning identifies all instances of a given feature in the training data and uses a similarity measure to generalise to new cases.
Model-based learning uses features in the training data to predict the outcome/ variable of interest; the model used to specify the relationship between the predictor(s) and outcome(s) are then generalised on new cases.
It was of interest to explore the utility of Machine Learning to directly impute for missing values in datasets. More specifically, the Methods Division was interested in examining whether Machine Learning models can improve the timeliness, reliability and accuracy of the imputation process in social survey data. Figure 1 presents the imputation pipeline for social survey data. Prior to imputation, units and values are reviewed, and those that are missing and should be routed to the item in question, are selected (i.e. flagged) for imputation. Data is then further processed by the Social Survey Division before imputed and observed data are compiled in an analysis dataset, used for publishing Official Statistics estimates.
Figure 1. Imputation pipeline in social survey data.
The intention was to use a machine learning system to impute flagged missing values. This model based approach for imputation may reduce the data processing time and improve the precision and reduce the variance of estimates. The current approach, which utilises nearest neighbour donor imputation involves the following:
Designing the selection criteria for donors can be time consuming as it requires analysts to identify matching variables (MV), along with weights for each MV. Teams currently use subject matter expertise in designing the donor imputation strategy for each variable. As this process is not data driven, it introduces an element of subjectivity and does not guarantee that matching variables selected are the best predictors of the variable of interest. In contrast, a data driven approach would be reproducible and identify the best predictors, in the dataset, to estimate missing values. Moreover, applying the machine learning system may offer a more parsimonious approach as fewer input paramters and files would be required in executing imputation.
The Methodology Division was interested in whether a Machine Learning System would perform better compared to the current imputation process with regards to:
The following Machine Learning tools were used in the investigation:
XGBoost is a set of open source functions and steps, referred to as a library, that use supervised ML where analysts specify an outcome to be estiamted/ predicted. The XGBoost library uses multiple decision trees to predict an outcome.
The ML system is trained using batch learning and generalised through a model based approach. It uses all available data to construct a model that specifies the relationship between the predictor and outcome variables, which are then generalised on the test data.
XGBoost stands for eXtreme Gradient Boosting. The word “extreme” reflects its goal to push the limit of computational resources. Whereas gradient boosting is a machine learning technique for regression and classification problems that optimises a collection of weak prediction models in an attempt to build an accurate and reliable predictor.
In order to build a better understanding of how XGBoost works, the documentation will briefly review:
Decision trees can be used as a method for grouping units in a dataset by asking questions, such as “Does an individual have a Bachelor’s degree?”. In this example, two groups would be created; one for those with a Bachelor’s degree, and one for those without. Figure 2 provides a visual depiction of this grouping in an attempt to explain Income.
Figure 2. Decision tree that splits units in a dataset based on whether individual has a Bachelor’s degree or not, in order to predict Income. The tree shows that those with a Bachelors degree on average earn more than than those wihtout a Bachelor’s degree.
Each subsequent question in a decision tree will produce a smaller group of units. This grouping is carried out to identify units with similar characteristics with respect to an outcome variable. The model in Figure 3 attempts to use University qualifications to predict Income.
Figure 3. Decision tree that splits units in a dataset based on whether individual has a Bachelor’s degree (yes/no), a Master’s degree and pHD (yes/no), in order to predict Income. The tree shows that those with a higher qualification tend to earn more.
The following characteristics are true of decision trees:
Decision trees are a learning method that involve a tree like graph to model either continuous or categorical choice given some data. It is composed of a series of binary questions, which when answered in succession yield a prediciton about data at hand. XGBoost uses Classification and Regression Trees (CART), which are presented in the above examples, to predict the outcome variable.
A single decision tree is considered a weak/ base learner as it only slightly better than chance at predicting the outcome variable. Whereas strong learners are any algorithm that can be tuned to achieve peak performance for supervised learning. XGBoost uses decision trees as base learners; combining many weak learners to form a strong learner. As a result it is referred to as an ensemble learning method; using the output of many models (i.e. trees) in the final prediction.
The concept of combining many weak learners to produce a strong learner is referred as boosting. XGBoost will iteratively build a set of weak models on subsets of the data; weighting each weak prediction according to the weak learner’s performance. A prediction is derived by taking the weighted sum of all base learners.
In the training data, a target variable \(y_{i}\) is specified, whilst all other features \(x_{i}\) are used as predictors of the target variable. A collection of decision trees are used to predict values of \(y_{i}\) using \(x_{i}\). Individually, each decision tree, would be a weak predictor of the outcome variable. However, as a collective, the decision trees may enable analysts to make accurate and reliable predictions of \(y_{i}\). As a result, the method for predicting the target variable using \(x_{i}\) in XGBoost is referred to as decision tree ensembles. The steps below demonstrate how XGBoost was used to build a model, to predict income, using Univeristy Qualifications.
library(caret)
library(xgboost)#### Load data ####
load("data/Income_tree.RData")
#### Remove identifier ####
Income <- Income[,-1]#### Split data into training and test ####
set.seed(5)
s <- createDataPartition(Income$income, p = 0.8, list=FALSE)
training <- Income[s,]
test <- Income[-s,]#### Convert the data to a matrix and assign output variable first ####
train.outcome <- training$income
train.predictors <- sparse.model.matrix(income ~ .,
data = training
)[, -1]
test.outcome <- test$income
test.predictors <- model.matrix(income ~ .,
data = test
)[, -1]
#### Convert the matrix objects to DMatrix objects ####
dtrain <- xgb.DMatrix(train.predictors, label=train.outcome)
dtest <- xgb.DMatrix(test.predictors)#### Train the model ####
model <- xgboost(
data = dtrain, max_depth = 2, eta = 1, nthread = 2, nrounds = 10,
objective = "reg:linear")#### Test the model ####
pred <- predict(model, dtest)
#### Evaluate the performance of model ####
RMSE(pred,test.outcome)#### Examine feature importance ####
importance_matrix <- xgb.importance(model = model)
print(importance_matrix)
xgb.plot.importance(importance_matrix = importance_matrix)# Tree 1
xgb.plot.tree(model = model, tree=0)
# Tree 2
xgb.plot.tree(model = model, tree=1)
# Tree 3
xgb.plot.tree(model = model, tree=2)The project evaluated the machine learning methods using:
1) The Census Teaching, a open dataset containing 1% of the person records from the 2011 Census in England & Wales.
2) Survey Data
The code below specifies the packages used in the preparation, study and build of machine learning systems using the Census Teaching File.
library(tidyverse)
library(mice)
library(reshape2)
library(GGally)
library(Matrix)
library(xgboost)
library(caret)
library(DiagrammeR)
library(MLmetrics)
library(rpart)
library(scales)
library(knitr)
library(kableExtra)The Census Teaching File was downloaded from the ONS website as a CSV file named “CensusTeachingFile”, and was read into R using the following line of code. The dataset consisted of 569,741 individuals and 18 categorical variables from the 2011 Census population.
# Read CSV into R
CensusRaw <- read.csv(
file = "Data/CensusTeachingFile.csv", skip = 1,
header = TRUE, sep = ","
)Variables in the dataset were renamed and recoded so that:
# Rename variables
Census <- plyr::rename(CensusRaw, c(
"Person.ID" = "person.id",
"Region" = "region",
"Residence.Type" = "residence.type",
"Family.Composition" = "fam.comp",
"Population.Base" = "resident.type",
"Sex"="sex",
"Age"="age",
"Marital.Status" = "marital.status",
"Student"="student",
"Country.of.Birth" = "birth.country",
"Health"="health",
"Ethnic.Group" = "ethnicity",
"Religion"="religion",
"Economic.Activity" = "econ.act",
"Occupation" = "occupation",
"Industry" = "industry",
"Hours.worked.per.week" = "hours.worked",
"Approximated.Social.Grade" = "social.grade"
))
# Recode variables
Census %>% mutate_if(is.factor, as.character) -> Census
# Recode the Region variable so that it is numeric
Census$region[Census$region == "E12000001"] <- 1
Census$region[Census$region == "E12000002"] <- 2
Census$region[Census$region == "E12000003"] <- 3
Census$region[Census$region == "E12000004"] <- 4
Census$region[Census$region == "E12000005"] <- 5
Census$region[Census$region == "E12000006"] <- 6
Census$region[Census$region == "E12000007"] <- 7
Census$region[Census$region == "E12000008"] <- 8
Census$region[Census$region == "E12000009"] <- 9
Census$region[Census$region == "W92000004"] <- 10
Census$residence.type[Census$residence.type == "C"] <- 1
Census$residence.type[Census$residence.type == "H"] <- 2
Census$student[Census$student == 1] <- 0
Census$student[Census$student == 2] <- 1
Census %>% mutate_if(is.character, as.numeric) -> Census
Census$person.id <- as.character(Census$person.id)
Ht <- table(Census$hours.worked)
Census$hours.cont <- ifelse(Census$hours.worked == 1, runif(
1:Ht[names(Ht) == 1],
1, 15
),
ifelse(Census$hours.worked == 2, runif(1:Ht[names(Ht) == 2], 16, 30),
ifelse(Census$hours.worked == 3, runif(1:Ht[names(Ht) == 3], 31, 48),
ifelse(Census$hours.worked == 4, runif(1:Ht[names(Ht) == 4], 49, 60),
Census$hours.worked
)
)
)
)
save(Census, file = "data/Census.Rda")A preview of the dataset is provided below.
For the purposes of training and testing a machine learning system, the data was divided into training and test datasets using the following code.
# Randomly select 80% of Census units and split into Train and Test data
set.seed(5)
Census80 <- sample(1:nrow(Census), 0.8 * nrow(Census), replace = FALSE)
Census20 <- setdiff(1:nrow(Census), Census80)
Census.train <- Census[Census80, ]
Census.test <- Census[Census20, ]
save(Census.train, file = "data/Census.train.Rda")
save(Census.test, file = "data/Census.test.Rda")The intention was to build models to predict each variable (listed in Table 1) using training data, which had no missingness. This model would then be evaluated with respect to its accuracy and generalisability using a test dataset, which would have missingness. The Census Teaching File was a complete dataset. As a result, missingness was simulated in the test dataset, and the imputation models (derived for each variable) were evaluated with regards to how well they predicted the true values.
Table 1. Variables to impute in Census Teaching File
Summary statistics were produced to review the pattern of responses of individuals included in the dataset. Please note, that all comments that refer to respondents reflect only the respondents included in the 2011 Census Teaching File, and not descriptive statistics for the Census Population as a whole. Summary statistics produced using the dataset show that:
# Study the data: 1) How many units, 2) How many attributes? and
# 3) How many missing units?
str(Census)
summary(Census)
sapply(apply(Census[, c(-1, -19)], 2, table), function(x) x / sum(x))
missing_values <- sapply(Census, function(y) sum(length(which(is.na(y)))))
missing_values <- data.frame(missing_values)
# Look for relationship between variables
ggcorr(Census[, -1],
nbreaks = 8, palette = "RdGy",
label = TRUE, label_size = 3, label_color = "white"
)Bar charts were used to review the distribution of responses for each categorical variable in the complete, training and test datasets (see Figures 2 to 4). As expected, there was a similar response pattern for a each variable between the complete, training and test datasets. The correlation matrix in Figure 5 presents the relationship between the variables in the complete dataset.
# Compare the test and training datasets
str(Census.train)
str(Census.test)
summary(Census.train)
summary(Census.test)
sapply(apply(Census.train[, c(-1, -19)], 2, table), function(x) x / sum(x))
sapply(apply(Census.test[, c(-1, -19)], 2, table), function(x) x / sum(x))
# Plot distribution of variables
melt.Census <- melt(Census)
head(melt.Census)
ggplot(data = melt.Census, aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = "free")
# Plot distribution of variables
melt.Census.train <- melt(Census.train)
head(melt.Census.train)
ggplot(data = melt.Census.train, aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = "free")
melt.Census.test <- melt(Census.test)
head(melt.Census.test)
ggplot(data = melt.Census.test, aes(x = value)) +
geom_bar() +
facet_wrap(~variable, scales = "free")Figure 2. Distribution of responses for variables in complete Census Teaching File
Figure 3. Distribution of responses for variables in training dataset
Figure 4. Distribution of responses for variables in test dataset
Figure 5. Correlation matrix of variables in the Census dataset